Parser for unified. Parses markdown to an
MDAST syntax tree. Used in the remark
processor. Can be extended to change how
markdown is parsed.
Installation
npm:
npm install remark-parse
Usage
var unified = require('unified');
var createStream = require('unified-stream');
var markdown = require('remark-parse');
var html = require('remark-html');
var processor = unified()
.use(markdown, {commonmark: true})
.use(html)
process.stdin
.pipe(createStream(processor))
.pipe(process.stdout);
Table of Contents
API
processor.use(parse[, options])
Configure the processor
to read markdown as input and process an
MDAST syntax tree.
options
Options are passed directly, or passed later through processor.data()
.
options.gfm
hello ~~hi~~ world
GFM mode (boolean
, default: true
) turns on:
options.commonmark
This is a paragraph
and this is also part of the preceding paragraph.
CommonMark mode (boolean
, default: false
) allows:
- Empty lines to split blockquotes
- Parentheses (
(
and )
) around for link and image titles - Any escaped ASCII-punctuation character
- Closing parenthesis (
)
) as an ordered list marker - URL definitions (and footnotes, when enabled) in blockquotes
CommonMark mode disallows:
- Code directly following a paragraph
- ATX-headings (
# Hash headings
) without spacing after opening hashes
or and before closing hashes - Setext headings (
Underline headings\n---
) when following a paragraph - Newlines in link and image titles
- White space in link and image URLs in auto-links (links in brackets,
<
and >
) - Lazy blockquote continuation, lines not preceded by a closing angle
bracket (
>
), for lists, code, and thematicBreak
Something something[^or something?].
And something else[^1].
[^1]: This reference footnote contains a paragraph...
* ...and a list
Footnotes mode (boolean
, default: false
) enables reference footnotes and
inline footnotes. Both are wrapped in square brackets and preceded by a caret
(^
), and can be referenced from inside other footnotes.
options.blocks
<block>foo
</block>
Blocks (Array.<string>
, default: list of block HTML elements)
exposes let’s users define block-level HTML elements.
options.pedantic
Check out some_file_name.txt
Pedantic mode (boolean
, default: false
) turns on:
- Emphasis (
_alpha_
) and importance (__bravo__
) with underscores
in words - Unordered lists with different markers (
*
, -
, +
) - If
commonmark
is also turned on, ordered lists with different
markers (.
, )
) - And pedantic mode removes less spaces in list-items (at most four,
instead of the whole indent)
parse.Parser
Access to the parser, if you need it.
Extending the Parser
Most often, using transformers to manipulate a syntax tree produces
the desired output. Sometimes, mainly when introducing new syntactic
entities with a certain level of precedence, interfacing with the parser
is necessary.
If the remark-parse
plugin is used, it adds a Parser
constructor
to the processor
. Other plugins can add tokenizers to the parser’s prototype
to change how markdown is parsed.
The below plugin adds a tokenizer for at-mentions.
module.exports = mentions;
function mentions() {
var Parser = this.Parser;
var tokenizers = Parser.prototype.inlineTokenizers;
var methods = Parser.prototype.inlineMethods;
tokenizers.mention = tokenizeMention;
methods.splice(methods.indexOf('text'), 0, 'mention');
}
Parser#blockTokenizers
An object mapping tokenizer names to tokenizers. These
tokenizers (for example: fencedCode
, table
, and paragraph
) eat
from the start of a value to a line ending.
See #blockMethods
below for a list of methods that are included by
default.
Parser#blockMethods
Array of blockTokenizers
names (string
) specifying the order in
which they run.
newline
indentedCode
fencedCode
blockquote
atxHeading
thematicBreak
list
setextHeading
html
footnote
definition
table
paragraph
Parser#inlineTokenizers
An object mapping tokenizer names to tokenizers. These tokenizers
(for example: url
, reference
, and emphasis
) eat from the start
of a value. To increase performance, they depend on locators.
See #inlineMethods
below for a list of methods that are included by
default.
Parser#inlineMethods
Array of inlineTokenizers
names (string
) specifying the order in
which they run.
escape
autoLink
url
html
link
reference
strong
emphasis
deletion
code
break
text
function tokenizer(eat, value, silent)
tokenizeMention.notInLink = true;
tokenizeMention.locator = locateMention;
function tokenizeMention(eat, value, silent) {
var match = /^@(\w+)/.exec(value);
if (match) {
if (silent) {
return true;
}
return eat(match[0])({
type: 'link',
url: 'https://social-network/' + match[1],
children: [{type: 'text', value: match[0]}]
});
}
}
The parser knows two types of tokenizers: block level and inline level.
Block level tokenizers are the same as inline level tokenizers, with
the exception that the latter must have a locator.
Tokenizers test whether a document starts with a certain syntactic
entity. In silent mode, they return whether that test passes.
In normal mode, they consume that token, a process which is called
“eating”. Locators enable tokenizers to function faster by providing
information on where the next entity may occur.
Signatures
Node? = tokenizer(eat, value)
boolean? = tokenizer(eat, value, silent)
Parameters
eat
(Function
) — Eat, when applicable, an entityvalue
(string
) — Value which may start an entitysilent
(boolean
, optional) — Whether to detect or consume
Properties
locator
(Function
)
— Required for inline tokenizersonlyAtStart
(boolean
)
— Whether nodes can only be found at the beginning of the documentnotInBlock
(boolean
)
— Whether nodes cannot be in blockquotes, lists, or footnote
definitionsnotInList
(boolean
)
— Whether nodes cannot be in listsnotInLink
(boolean
)
— Whether nodes cannot be in links
Returns
- In silent mode, whether a node can be found at the start of
value
- In normal mode, a node if it can be found at the start of
value
tokenizer.locator(value, fromIndex)
function locateMention(value, fromIndex) {
return value.indexOf('@', fromIndex);
}
Locators are required for inline tokenization to keep the process
performant. Locators enable inline tokenizers to function faster by
providing information on the where the next entity occurs. Locators
may be wrong, it’s OK if there actually isn’t a node to be found at
the index they return, but they must skip any nodes.
Parameters
value
(string
) — Value which may contain an entityfromIndex
(number
) — Position to start searching at
Returns
Index at which an entity may start, and -1
otherwise.
eat(subvalue)
var add = eat('foo');
Eat subvalue
, which is a string at the start of the
tokenized value
(it’s tracked to ensure the correct
value is eaten).
Parameters
subvalue
(string
) - Value to eat.
Returns
add
.
add(node[, parent])
var add = eat('foo');
add({type: 'text', value: 'foo'});
Add positional information to node
and add it to parent
.
Parameters
node
(Node
) - Node to patch position on and insertparent
(Node
, optional) - Place to add node
to in
the syntax tree. Defaults to the currently processed node
Returns
The given node
.
add.test()
Get the positional information which would be patched on
node
by add
.
Returns
Location
.
add.reset(node[, parent])
add
, but resets the internal location. Useful for example in
lists, where the same content is first eaten for a list, and later
for list items
Parameters
node
(Node
) - Node to patch position on and insertparent
(Node
, optional) - Place to add node
to in
the syntax tree. Defaults to the currently processed node
Returns
The given node
.
Turning off a tokenizer
In rare situations, you may want to turn off a tokenizer to avoid parsing
that syntactic feature. This can be done by deleting the tokenizer from
your Parser’s blockTokenizers
(or blockMethods
) or inlineTokenizers
(or inlineMethods
).
The following example turns off indented code blocks:
delete remarkParse.Parser.prototype.blockTokenizers.indentedCode;
License
MIT © Titus Wormer